{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Final polishing stage for recordmeta\n",
    "\n",
    "This is the final polishing stage for ```recordmeta.tsv.``` It reconciles the probabilistic columns ```juvenileprob``` and ```nonficprob``` with ground truth. (There's no point relying on \"predictions\" about volumes where we actually know the truth.)\n",
    "\n",
    "Then it corrects docids, to deal with errors and update some ids that have changed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### read in existing recordmeta, enriched by make_predictions.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['oldauthor', 'author', 'authordate', 'inferreddate', 'latestcomp', 'datetype', 'startdate', 'enddate', 'imprint', 'imprintdate', 'contents', 'genres', 'subjects', 'geographics', 'locnum', 'oclc', 'place', 'recordid', 'enumcron', 'volnum', 'title', 'parttitle', 'shorttitle', 'instances', 'juvenileprob', 'nonficprob']\n"
     ]
    }
   ],
   "source": [
    "record = pd.read_csv('../enrichedrecordmeta.tsv', sep = '\\t', index_col = 'docid', low_memory = False)\n",
    "print(record.columns.tolist())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### read in ground truth"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "ground = pd.read_csv('../manuallists/union_of_subsets.csv', index_col = 'docid', low_memory = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### make corrections\n",
    "\n",
    "Specifically, we check each index in recordmeta against ground. If it matches, we check the category listed in ground, and set the ```juvenileprob``` and ```nonficprob``` to zero or one as appropriate.\n",
    "\n",
    "If the predicted probabilities are not null, we add them to a list that we're going to use to estimate averages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "yesnon = []\n",
    "nonon = []\n",
    "unknon = []\n",
    "yesjuv = []\n",
    "nojuv = []\n",
    "unkjuv = []\n",
    "\n",
    "for idx in record.index:\n",
    "    juvpredict = record.loc[idx, 'juvenileprob']\n",
    "    nonpredict = record.loc[idx, 'nonficprob']\n",
    "    \n",
    "    if idx not in ground.index and not pd.isnull(juvpredict):\n",
    "        unkjuv.append(juvpredict)\n",
    "    elif idx in ground.index:\n",
    "        truecat = ground.loc[idx, 'category']\n",
    "        if truecat == 'juvenile':\n",
    "            if not pd.isnull(juvpredict):\n",
    "                yesjuv.append(juvpredict)\n",
    "            record.loc[idx, 'juvenileprob'] = 1.0\n",
    "        else:\n",
    "            if not pd.isnull(juvpredict):\n",
    "                nojuv.append(juvpredict)\n",
    "            record.loc[idx, 'juvenileprob'] = 0.0\n",
    "    \n",
    "    if idx not in ground.index and not pd.isnull(nonpredict):\n",
    "        unknon.append(nonpredict)\n",
    "    elif idx in ground.index:\n",
    "        truecat = ground.loc[idx, 'category']\n",
    "        if truecat == 'notfiction':\n",
    "            if not pd.isnull(nonpredict):\n",
    "                yesnon.append(nonpredict)\n",
    "            record.loc[idx, 'nonficprob'] = 1.0\n",
    "        else:\n",
    "            if not pd.isnull(nonpredict):\n",
    "                nonon.append(nonpredict)\n",
    "            record.loc[idx, 'nonficprob'] = 0.0\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### some EDA\n",
    "\n",
    "I was just curious what predicted probabilities actually are, on average, for volumes that we *know* to be nonfiction (or know not to be) compared with those we don't know."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.791014044048\n",
      "0.259495901629\n",
      "0.207118329161\n"
     ]
    }
   ],
   "source": [
    "print(np.mean(yesnon))\n",
    "print(np.mean(unknon))\n",
    "print(np.mean(nonon))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The upshot is that our model is pretty confident about the examples known to be nonfiction (```yesnon```), but doesn't place the average volume known to be fiction (```nonon```) much lower than the average unknown volume. This makes sense, as most unknown volumes are indeed fiction.\n",
    "\n",
    "The same pattern holds for juvenile fiction. The last two numbers in this sequence are much closer to each other than the middle number is to the first:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.956455906831\n",
      "0.158700985652\n",
      "0.122038997642\n"
     ]
    }
   ],
   "source": [
    "print(np.mean(yesjuv))\n",
    "print(np.mean(unkjuv))\n",
    "print(np.mean(nojuv))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### some unimportant error checking"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "errorcount = 0\n",
    "for idx in record.index:\n",
    "    if ':' in idx:\n",
    "        errorcount += 1\n",
    "errorcount"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "errorcount = 0\n",
    "for idx in ground.index:\n",
    "    if idx not in record.index:\n",
    "        errorcount += 1\n",
    "errorcount"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "trans = pd.read_csv('../data/filename_translator.tsv', sep = '\\t', index_col = 'badname')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>goodname</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>badname</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>uc1.b250374</th>\n",
       "      <td>uc1.$b250374</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uc1.b250174</th>\n",
       "      <td>uc1.$b250174</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uc1.b249745</th>\n",
       "      <td>uc1.$b249745</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uc1.b70887</th>\n",
       "      <td>uc1.$b70887</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uc1.b182851</th>\n",
       "      <td>uc1.$b182851</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 goodname\n",
       "badname                  \n",
       "uc1.b250374  uc1.$b250374\n",
       "uc1.b250174  uc1.$b250174\n",
       "uc1.b249745  uc1.$b249745\n",
       "uc1.b70887    uc1.$b70887\n",
       "uc1.b182851  uc1.$b182851"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trans.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "70"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "errorcount = 0\n",
    "for idx in record.index:\n",
    "    if idx in trans.index:\n",
    "        errorcount += 1\n",
    "errorcount"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note that** we have not yet corrected bad docids.\n",
    "\n",
    "Write the corrected file to disk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "record.to_csv('../recordmeta.tsv', sep = '\\t', index_label = 'docid')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
